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ABSTRACT 

The Internet has enabled the creation of a growing num- 
ber of large-scale knowledge bases in a variety of domains 
containing complementary information. Tools for automat- 
ically aligning these knowledge bases would make it possi- 
ble to unify many sources of structured knowledge and an- 
swer complex queries. However, the efficient alignment of 
large-scale knowledge bases still poses a considerable chal- 
lenge. Here, we present Simple Greedy Matching (SiGMa), 
a simple algorithm for aligning knowledge bases with mil- 
lions of entities and facts. SiGMa is an iterative propaga- 
tion algorithm which leverages both the structural informa- 
tion from the relationship graph as well as flexible similarity 
measures between entity properties in a greedy local search, 
thus making it scalable. Despite its greedy nature, our ex- 
periments indicate that SiGMa can efficiently match some 
of the world's largest knowledge bases with high precision. 
We provide additional experiments on benchmark datasets 
which demonstrate that SiGMa can outperform state-of-the- 
art approaches both in accuracy and efficiency. 

1. INTRODUCTION 

In the last decade, a growing number of large-scale knowl- 
edge bases have been created online. Examples of domains 
include music, movies, publications and biological data 1 . As 
these knowledge bases sometimes contain both overlapping 
and complementary information, there has been growing in- 
terest in attempting to merge them by aligning their com- 
mon elements. This alignment could have important uses 
for information retrieval and question answering. For ex- 
ample, one could be interested in finding a scientist with 
expertise on certain related protein functions - information 

x Such as MusicBrainz, IMDb, DBLP and UnitProt. 



which could be obtained by aligning a biological database 
with a publication one. Unfortunately, this task is challeng- 
ing to automate as different knowledge bases generally use 
different terms to represent their entities, and the space of 
possible matchings grows exponentially with the number of 
entities. 

A significant amount of research has been done in this area 
- particularly under the umbrella term of ontology matching 
[6, 15, 8]. An ontology is a formal collection of world knowl- 
edge and can take different structured representations. In 
this paper, we will use the term knowledge base to empha- 
size that we assume very little structure about the ontology 
(to be specified in Section 2) . Despite the large body of lit- 
erature in this area, most of the work on ontology matching 
has been demonstrated only on fairly small datasets of the 
order of a few hundred entities. In particular, Shvaiko and 
Euzenat [24] identified large-scale evaluation as one of the 
ten challenges for the field of ontology matching. 

In this paper, we consider the problem of aligning the in- 
stances in large knowledge bases, of the order of millions of 
entities and facts, where aligning means automatically iden- 
tifying corresponding entities and interlinking them. Our 
starting point was the challenging task of aligning the movie 
database IMDb to the Wikipedia-based YAGO [27], as an- 
other step towards the Semantic Web vision of interlinking 
different sources of knowledge which is exemplified by the 
Linking Open Data Initiative 2 [4]. Initial attempts to match 
IMDb entities to YAGO entities by naively exploiting string 
and neighborhood information failed, and so we designed 
SiGMa (Simple Greedy Matching), a scalable greedy iterative 
algorithm which is able to exploit previous matching deci- 
sions as well as the relationship graph information between 
entities. 

The design decisions behind SiGMa were both to be able to 
take advantage of the combinatorial structure of the match- 
ing problem (by contrast with database record linkage ap- 
proaches which make more independent decisions) as well 
as to focus on a simple approach which could be scalable. 
SiGMa works in two stages: it first starts with a small seed 
matching assumed to be of good quality. Then the algorithm 
incrementally augments the matching by using both struc- 
tural information and properties of entities such as their 



2 http:/ /linkeddata.org/ 



string representation to define a modular score function. 
Some key aspects of the algorithm are that (1) it uses the 
current matching to obtain structural information, thereby 
harnessing information from previous decisions; (2) it pro- 
poses candidate matches in a local manner, from the struc- 
tural information; and (3) it makes greedy decisions, en- 
abling a scalable implementation. A surprising result is that 
we obtained accurate large-scale matchings in our experi- 
ments despite the greediness of the algorithm. 

Contributions. The contributions of the present work are 
the following: 

1. We present SiGMa, a knowledge base alignment algo- 
rithm which can handle millions of entities. The al- 
gorithm is easily extensible with tailored scoring func- 
tions to incorporate domain knowledge. It also pro- 
vides a natural tradeoff between precision and recall, 
as well as between computation and recall. 

2. In the context of testing the algorithm, we constructed 
two large-scale partially labeled knowledge base align- 
ment datasets with hundreds of thousands of ground 
truth mappings. We expect these to be a useful re- 
source for the research community to develop and eval- 
uate new knowledge base alignment algorithms. 

3. We provide a detailed experimental comparison illus- 
trating how SiGMa improves over the state-of-the-art. 
SiGMa is able to align knowledge bases with millions 
of entities with over 95% precision in less than two 
hours (a 50x speed-up over [26]). On standard bench- 
mark datasets, SiGMa obtains solutions with higher 
F-measure than the best previously published results. 

The remainder of the paper is organized as follows. Sec- 
tion 2 presents the knowledge base alignment problem with 
a real-world example as motivation for our assumptions. We 
describe the algorithm SiGMa in Section 3. We evaluate it 
on benchmark and on real-world datasets in Section 4, and 
situate it in the context of related work in Section 5. 

2. ALIGNING LARGE-SCALE KNOWLEDGE 
BASES 

2.1 Motivating example: yago and iMDb 

Consider merging the information in the following two 
knowledge bases: 

1. YAGO, a large semantic knowledge base derived from 
English Wikipedia [27], WordNet [9] and GeoNames. 3 

2. IMDb, a large popular online database that stores in- 
formation about movies. 4 

The information in YAGO is available as a long list of triples 
(called facts) that we formalize as: 



YAGO 



IMDb 



<e,r, e), 



(1) 



which means that the directed relationship r holds from en- 
tity e to entity e', such as (John_Travolta, Actedln, Grease). 



actedln 
directed 
produced 
created 
hasLabel* 
wasCreatedOnDate* 



actedln 
directed 
produced 
composed 
hasLabel* 
hasProductionYear* 



3 http://www. geonames.org/ 
4 http://www. imdb.com/ 



Table 1: Manually matched relations between YAGO 

and IMDb. The starred pairs are actually pairs of proper- 
ties, as defined in the text. 



The information from IMDb was originally available as sev- 
eral files which we merged into a similar list of triples. We 
call these two databases knowledge bases to emphasize that 
we are not assuming a richer representation, such as RDFS 
[29], which would distinguish between classes and instances 
for example. In the language of ontology matching, our 
setup is the less studied instance matching problem, as point- 
ed out by Castano et al. [5], for which the goal is to match 
concrete instantiations of concepts such as specific actors 
and specific movies rather than the general actor or movie 
class. YAGO comes with an RDFS representation, but not 
IMDb; therefore we will focus on methods that do not as- 
sume or require a class structure or rich hierarchy in order 
to find a one-to-one matching of instances between YAGO 
and IMDb. 

We note that in the full generality of the ontology match- 
ing problem, both the schema and the instances of one on- 
tology are to be related with the ones of the other ontology. 
Moreover, in addition to the isSameAs (or "=") relationship 
that we consider, these matching relationships could be is- 
MoreGeneralThan ("3")> isLessGeneralThan ("C") or even 
hasPartialOverlap. In our example, because the number 
of relations in the knowledge bases is relatively small (108 
in YAGO and 10 in IMDb), we could align the relations man- 
ually, discovering six equivalent ones as listed in Table 1. 
As we will see in our experiments, focussing uniquely on 
the isSameAs type of relationship between instances of the 
two knowledge bases is sufficient in the YAGO-IMDb setup 
to cover most cases. The exceptions are rare enough for 
SiGMa to obtain useful results while making the simplify- 
ing assumption that the alignment between the instances is 
infective (1-1). 

Relationships vs. properties. Given our assumption that 
the alignment is 1-1, it is important to distinguish between 
two types of objects which could be present in the list of 
triples: entities vs. literals. By our definition, the entities 
will be the only objects that we will try to align - they will 
be objects like specific actors or specific movies which have 
a clear identity. The literals, on the other hand, will corre- 
spond to a value related to an entity through a special kind 
of relationship that we will call property. The defining char- 
acteristic of literals is that it would not make sense to try to 
align them between the two knowledge bases in a 1-1 fashion. 
For example, in the YAGO triple (ml, wasCreatedOnDate, 
1999-12-11), the object 1999-12-11 could be interpreted as 
a literal representing the value for the property wasCreated- 
OnDate for the entity ml. The corresponding property in 
our version of IMDb is hasProductionYear which has val- 
ues only at the year granularity (1999). The 1-1 restriction 



would prevent us to align both 1999-12-11 and 1999-12- 
10 to 1999. On the other hand, we can use these literals 
to define a similarity score between entities from the two 
knowledge bases (for example in this case, whether the year 
matches, or how close the dates are to each other). We will 
thus have two types of triples: entity-relationship-entity and 
entity-property-literal. We assume that the distinction be- 
tween relationships and properties (which depends on the 
domain and the user's goals) is easy to make; for example, 
in the Freebase dataset that we also used in our experiments, 
the entities would have unique identifiers but not the literals. 
Figure 1 provides a concrete example of information presents 
in the two knowledge bases that we will keep re-using in this 
paper. 

We are now in a position to state more precisely the prob- 
lem that we address. 

Definition: A knowledge base KB is a tuple 
(£, C, TZ,V,J-r, J-p) where £, C, 1Z and V are sets of entities, 
literals, relationships and properties respectively; Fr C £ x 
1Z x £ is a set of relationship-facts whereas Tp C £ x V x C is 
a set of property-facts (both can be represented as a simple 
list of triples). To simplify the notation, we assume that all 
inverse relations are also present in J-r - that is, if (e, r, e') 
is in Tr, we also have (e', r~ , e) in Fr, effectively doubling 
the number of possible relations in the KB. J 

Problem: one-to-one alignment of instances be- 
tween two knowledge bases. Given two knowledge bases 
KBi and KB2 as well as a partial mapping between their cor- 
responding relationships and properties, we want to output 
a 1-1 partial mapping m from £\ to £2 which represents the 
semantically equivalent entities in the two knowledge bases 
(by partial mapping, we mean that the domain of m does 
not have to be the whole of £1). 

2.2 Possible approaches 

Standard approaches for the ontology matching problem, 
such as RiMOM [18], could be used to align small knowl- 
edge bases. However, they do not scale to millions of entities 
as needed for our task given that they usually consider all 
pairs of entities, suffering from a quadratic scaling cost. On 
the other hand, the related problem of identifying duplicate 
entities known as record linkage or duplicate detection in 
the database field, and co-reference resolution in the natu- 
ral langue processing field, do have scalable solutions [1, 11], 
though these do not exploit the 1-1 matching combinatorial 
structure present in our task, which reduces their accuracy. 
More specifically, they usually make independent decisions 
for different entities using some kind of similarity function, 
rather than exploiting the competition between different as- 
signments for entities. A notable exception is the work on 
collective entity resolution by Bhattacharya and Getoor [3], 
solved using a greedy agglomerative clustering algorithm. 
The algorithm SiGMa that we present in Section 3 can ac- 
tually be seen as an efficient specialization of their work to 
the task of knowledge base alignment. 

Another approach to alignment arises from the word align- 
ment problem in natural language processing [21], which has 
been formulated as a maximum weighted bipartite matching 
problem [28] (thus exploiting the 1-1 matching structure). It 
also has been formulated as a quadratic assignment problem 

5 This allows us to look at only one standard direction of 
facts and cover all possibilities - see for example how it is 
used in the definition of compatible-neigbhors in (4). 



YAGO IMDb 




Figure 1: Example of neighborhood to match in 
YAGO and IMDb. Even though entities i and j have no 
words in common, the fact that several of their respective 
neighbors are matched together is a strong signal that i and 
j should be matched together. This is a real example from 
the dataset used in the experiments and SiGMa was able to 
correctly match all these pairs (i and j are actually the same 
movie despite their different stored titles in each KB). 

in [16], which encourages neighbor entities in one graph to 
align to neighbor entities in the other graph, thus enabling 
alignment decisions to depend on each other — see the cap- 
tion of Figure 1 for an example of this in our setup. The 
quadratic assignment formulation [17], which can be solved 
as an integer linear program, is NP-hard in general though, 
and these approaches were only used to align at most one 
hundred entities. In the algorithm SiGMa that we propose, 
we are interested in exploiting both the 1-1 matching con- 
straint, as well as building on previous decisions, like these 
word alignment approaches, but in a scalable manner which 
would handle millions of entities. SiGMa does this by greed- 
ily optimizing the quadratic assignment objective, as we will 
describe in Section 3.1. Finally, Suchanek et al. [26] recently 
proposed an ontology matching approach called PARIS that 
they have succeeded to apply on the alignment of YAGO to 
IMDb as well, though the scalability of their approach is not 
as clear, as we will explain in Section 5. We will provide a 
detailed comparison with PARIS in the experiments section. 

2.3 Design choices and assumptions 

Our main design choices result from our need for a fast 
algorithm for knowledge base alignment which scales to mil- 
lions of entities. To this end we made the following assump- 
tions: 

1-1 matching and uniqueness. We assume that the 
true alignment between the two KBs is a partial function 
which is mainly 1-1. If there are duplicate entities inside 
a KB, SiGMa will only align one of the duplicates to the 
corresponding entity in the other KB. 

Aligned relationships. We assume that we are given a 
partial alignment between relationships and between prop- 
erties of the KBs. 

3. THE SIGMA ALGORITHM 

3. 1 Greedy optimization of a quadratic assign- 
ment objective 

The SiGMa algorithm can be seen as the greedy opti- 
mization of an objective function which globally scores the 
suitability of a particular matching m for a pair of given 



KBs. This objective function will use two sources of in- 
formation useful to choose matches: a similarity function 
between pairs of entities defined from their properties; and 
a graph neighborhood contribution making use of neighbor 
pairs being matched (see Figure 1 for a motivation). Let 
us encode the matching m : £ i — > £2 by a matrix y with 
entries indexed by the entities in each KB, with yy = 1 
if m(i) = j, meaning that i G £ 1 is matched to j G £2, 
and Uij = otherwise. The space of possible 1-1 partial 
mappings is thus represented by the set of binary matri- 
ces: M = {y G {0,l} £lXfi2 : £,y« < 1 Vi G £1 and 
^2kV k j — 1 ^3 ^ £2}- We define the following quadratic 
objective function which globally scores the suitability of a 
matching y: 

ob j (v) = f« K 1 _ a ) Si J + a 9ij(v)] , 

where ffy(y) = ^ VklWij.kl- 

The objective contains linear coefficients sy which encode 
a similarity between entity i and j, as well as quadratic 
coefficients Wij t ki which control the algorithm's tendency to 
match i with j given that k was matched to I 6 . Mij is a local 
neighborhood around that we define later and which 

will depend on the graph information from the KBs - gij (y) 
is basically counting (in a weighted fashion) the number of 
matched pairs (k, I) which are in the neighborhood of i and 
j. ct G [0, 1] is a tradeoff parameter between the linear and 
quadratic contributions. Our approach is motivated by the 
maximization problem: 

max ob j (y) 

(3) 

s.t. yeM, \\y\\i<R, 

where the norm ||y||i = Y^ijVv represents the number of 
elements matched and R is an unknown upper-bound which 
represents the size of the best partial mapping which can 
be made from KB\ to KB2- We note that if the coeffi- 
cients are all positive (as will be the case in our formulation 
- we are only encoding similarities and not repulsions be- 
tween entities), then the maximizer y* will have ||y*||i = R. 
Problem (3) is thus related to one of the variations of the 
quadratic assignment problems, a well-known NP-complete 
problem in operational research [17] 7 . Even though one 
could approximate the solution to the combinatorial opti- 
mization (3) using a linear program relaxation (see Lacoste- 
Julien et al. [16]), the number of variables is quadratic in 
the number of entities, and so is obviously not scalable. Our 
approach is instead to greedily optimize (3) by adding the 
match element yy = 1 at each iteration which increases the 
objective the most and selected amongst a small set of possi- 
bilities. In other words, the high-level operational definition 
of the SiGMa algorithm is as follows: 

1. Start with an initial good quality partial match yo. 

2. At each iteration t, augment the previous matching 
with a new matched pair by setting yy = 1 for the 

6 In the rest of this paper, we will use the convention that 
i and k are always entities in KB\; whereas j and / are in 
KB2. e could be in either KB. 

7 See Appendix C for the traditional description of the 
quadratic assignment problem and its relationship to our 
problem. 



(j, j) which maximally increases obj, chosen amongst 
a small set St of reasonable candidates which preserve 
the feasibility of the new matching. 

3. Stop when the bound ||y||i = R is reached (and never 
undo previous decisions). 

Having outlined the general framework, in the remainder 
of this section we will describe methods for choosing the 
similarity coefficients sy and wy^; so that they guide the 
algorithm towards good matchings (Section 3.3), the choice 
of neighbors, A/y, the choice of a candidate set St, and the 
stopping criterion, R. These choices influence both the speed 
and accuracy of the algorithm. 

Compatible-neighbors. A/y should be chosen so as to 
respect the graph structure defined by the KB facts. Its 
contribution in the objective crucially encodes the fact that 
a neighbor k of i being matched to a 'compatible' neighbor I 
of j should encourage i to be matched to j — see the caption 
of Figure 1 for an example. Here, compatibility means that 
they are related by the same relationship (they have the 
same color in Figure 1). Formally, we define: 

A/y = compatible-neighbors(i, j) = 

{ (k, I) : (i, r, k) is in Tm and (j, s, 1} is in Tr2 
and relationship r is matched to s}. ' 

Note that a property of this neighborhood is that (k, I) G J\fij 
iff G Mki, as we have that the relationship r is matched 
to s iff r _1 is matched to s^ 1 as well. This means that 
the increase in the objective obtained by adding (i,j) to 
the current matching y defines the following context depen- 
dent similarity score function which is used to pick the next 
matched pair in the step 2 of the algorithm: 

score(i, j; y) = (1 - a)s ZJ + a 5g l3 (y) 

where 5 (y) = ^ y fei {w ijM + tujw,y). (5) 

Information propagation on the graph. The compati- 
ble-neighbors concept that we just defined is one of the 
most crucial characteristics of SiGMa. It allows the infor- 
mation of a new matched pair to propagate amongst its 
neighbors. It also defines a powerful heuristic to suggest 
new candidate pairs to include in a small set St of matches 
to choose from: after matching i to j, SiGMa adds all the 
pairs (k, I) from compatible-neighbors(i, j) as new candi- 
dates. This yields a fire propagation analogy for the algo- 
rithm: starting from an initial matching (fire) - it starts 
to match their neighbors, letting the fire propagate through 
the graph. If the graph in each KB is well-connected in a 
similar fashion, it can visit most nodes this way. This heuris- 
tic enables SiGMa to avoid the potential quadratic number 
of pairs to consider by only focussing its attention on the 
neighborhoods of current matches. 

Stopping criterion. SiGMa terminates when the varia- 
tion in the objective value, score(i, j; y), of the latest added 
match falls below a threshold (or the queue becomes 

empty). The threshold in effect controls the precision / re- 
call tradeoff of the algorithm. By ensuring that the Sy and 
gij(y) terms are normalized between and 1, we can stan- 
dardize the scale of the threshold for different score func- 
tions. In our experiments, a threshold of 0.25 is observed 
to correlate well with a point at which the F-measure stops 
increasing and the precision is significantly decreasing. 



1: Initialize matching m = mo. 

2: Initialize priority queue S of suggested candidate pairs 
as So U ^U(i jjem^i) ~~ ^ ne com P a tible-neigbhors of 
pairs in m, with score(i, j; m) as their key. 

3: while priority queue S is not empty do 

4: Extract (score, i,j) from queue S 

5: if score < threshold then stop 

6: if i or j is already matched to some entity then 

7: skip them and continue loop 

8: else 

9: Set m(i) = j. 

{We update candidate lists and scores:} 
10: for (k, I) in M%j and not already matched do 
11: Add (score(fc, I; m), k, 1} to queue S. 



Table 2: SiGMa algorithm. 

3.2 Algorithm and implementation 

We present the pseudo-code for SiGMa in Table 2. We 
now elaborate on the algorithm design as well as its imple- 
mentation aspects. We note that the score defined in (5) to 
greedily select the next matched pair is composed of a static 
term Sy, which does not depend on the evolving matching 
y, and a dynamic term 5gij(y), which depends on y, though 
only through the local neighborhood Nij. We call the Sgtj 
component of the score function the graph contribution - its 
local dependence means that it can be updated efficiently 
after a new match has been added. We explain in more de- 
tails the choice of similarity measures for these components 
in Section 3.3. 

Initial match Structure mo. The algorithm can take any 
initial matching seed assumed of good quality. In our current 
implementation, this is done by looking for entities with the 
same string representation (with minimal standardization 
such as removing capitalization and punctuation) with an 
unambiguous 1-1 match - that is, we do not include an exact 
matched pair when more than two entities have this same 
string representation, thereby increasing precision. 

Increasing score function with local dependence. The 
score function has a component Sij which is static (fixed at 
the beginning of the algorithm) from the properties of en- 
tities such as their string representation, and a component 
Sgij (y) which is dynamic, looking at how many neighbors are 
correctly matched. The dynamic part can actually only in- 
crease when new neighbors are matched, and only the scores 
of neighbors can change when a new pair is matched. 

Optional static list of candidates So- Optionally, we can 
initialize S with a static list So which only needs to be scored 
once as any score update will come from neighbors already 
covered by step 11 of the algorithm. So has the purpose to 
increase the possible exploration of the graph when another 
strong source of information (which is not from the graph) 
can be used. In our implementation, we use an inverted 
index built on words to efficiently suggest entities which have 
at least two words in common in their string representation 
as potential candidates. 8 



Data-Structures. We use a binary heap for the priority 
queue implementation — insertions will thus be 0(log n) where 
n is the size of the queue. Because the score function can 
only increase as we add new matches, we do not need to 
keep track of stale nodes in the priority queue in order to 
update their scores, yielding a significant speed-up. 

3.3 Score functions 

An important factor for any matching algorithm is the 
similarity function between pairs of elements to match. De- 
signing good similarity functions has been the focus of much 
of the literature on record linkage, entity resolution, etc., and 
because SiGMa uses the score function in a modular fashion, 
SiGMa is free to use most of them for the term Sij as long 
as they can be computed efficiently. We provide in this sec- 
tion our implementation choices (which were motivated by 
simplicity), but we note that the algorithm can easily han- 
dle more powerful similarity measures. The generic score 
function used by SiGMa was given in (5). In the current 
implementation, the static part Sij is defined through the 
properties of entities only. The graph part 5gij(y) depends 
on the relationships between entities (as this is what deter- 
mines the graph), as well as the previous matching y. We 
also make sure that Sij and gij stay normalized so that the 
score of different pairs are on the same scale. 

3.3.1 Static similarity measure 

The static property similarity measure is further decom- 
posed in two parts: we single out a contribution coming from 
the string representation property of entities (as it is such 
a strong signal for our datasets), and we consider the other 
properties together in a second term: 



s i:j = (1 - /?)string(i, j) + /3prop(i, j), 



(6) 



where j3 £ [0, 1] is a tradeoff coefficient between the two 
contributions set to 0.25 during the experiments. 

String similarity measure. For the string similarity mea- 
sure, we primarily consider the number of words which two 
strings have in common, albeit weighted by their information 
content. In order to handle the varying lengths of strings, 
we use the Jaccard similarity coefficient between the sets of 
words, a metric often used in information retrieval and other 
data mining fields [12, 3]. The Jaccard similarity between 
set A and B is defined as Jaccard(A, B) = \ A<1 B\/\AU B\, 
which is a number between and 1 and so is normalized 
as required. We also add a smoothing term in the denom- 
inator in order to favor longer strings with many words in 
common over very short strings. Finally, we use a weighted 
Jaccard measure in order to capture the information that 
some words are more informative than others. In analogy 
to a commonly used feature in information retrieval, we use 
the IDF (inverse-document-frequency) weight for each word. 



The weight for word v in KB is w° = log 



10 \E° 



where 



E° = {e g £ : e has word v in its string representation}. 
Combining these elements, we get the following string simi- 
larity measure: 



string(i, j) = 



(wl + w 2 v ) 
us(w i nvv i ) 



To keep the number of suggestions manageable, we exclude 
a list of stop words built automatically from the 1,000 most 



smoothing + S ' u>l + S ' 



(7) 



frequent words of each KB. 



where W e is the set of words in the string representation of 
entity e and smoothing is the scalar smoothing constant (we 
try different values in the experiments). Using unit weights 
and removing the smoothing term would recover the stan- 
dard Jaccard coefficient between the two sets. As it operates 
on set of words, this measure is robust to word re-ordering, 
a frequently observed variation between strings represent- 
ing the same entity in different knowledge bases. On the 
other hand, this measure is not robust to small typos or 
small changes of spelling of words. This problem could be 
addressed by using more involved string similarity measures 
such as approximate string matching [fO, 25], which handles 
both word corruption as well as word reordering, though 
our current implementation only uses (7) for simplicity. We 
will explore the effect of different scoring functions in our 
experiments in Section 4.5. 

Property similarity measure. We recall that we assume 
that the user provided a partial matching between prop- 
erties of both databases. This enables us to use them in 
a property similarity measure. In order to elegantly handle 
missing values of properties, varying number of property val- 
ues present, etc., we also use a smoothed weighted Jaccard 
similarity measure between the sets of properties. The de- 
tailed formulation is given in Appendix A for completeness, 
but we note that it can make use of a similarity measure 
between literals such a normalized distance on numbers (for 
dates, years etc.) or a string-edit distance on strings. 

3.3.2 Dynamic graph similarity measure 

We now introduce the part of the score function which en- 
ables SiGMa to build on previous decisions and exploit the 
relationship graph information. We need to determine Wij t ki , 
the weight of the contribution of a neighboring matched pair 
(k, I) for the score of the candidate pair The gen- 

eral idea of the graph score function is to count the num- 
ber of compatible neighbors which are currently matched 
together for a pair of candidates (this is the gij(y) contribu- 
tion in (2)). Going back at the example in Figure 1, there 
were three compatible matched pairs shown in the neighbor- 
hood of i and j. We would like to normalize this count by 
dividing by the number of possible neighbors, and we would 
possibly want to weight each neighbor differently. We again 
use a smoothed weighted Jaccard measure to summarize this 
information, averaging the contribution from each KB. This 
can be obtained by defining Wij^i = jiWik + JjWji, where 
7i and 7,- are normalization factors specific to % and j in 
each database and Wik is the weight of the contribution of 
k to i in KBi (and similarly for Wki in KB 2)- The graph 
contribution thus becomes: 

9ij(v)= Z ykl(jiWik +jjWji). (8) 

So let Mi be the set of neighbors of entity i in KB\, i.e. A/i == 
{k : 3r s.t. (i,r,k) £ J-ri} (and similarly for Afj). Then, 
remembering that ^ fe y k i < 1 for a valid partial matching 
y 6 M, the following normalizations 7* and jj will yield 
the average of two smoothed weighted Jaccard measures for 

T* = \ 1 + Z) Wik ) V = \\ 1 + Z w * ( 9 ) 



We thus have gij (y) < 1 for y £ M, keeping the contribution 
of each possible matched pair on the same scale in obj 
in (2). 

The graph part of the score in (5) then takes the form: 
f>9ij{y) = yki(liWik+ r y J w j i+'y k w ki +'yiwi j ). (10) 

The summation over the first two terms yields gij(y) and 
so is bounded by f, but the summation over the last two 
terms could be greater than f in the case that is filling 
a 'hole' in the graph (thus increasing the contribution of 
many neighbors (k,l) in obj in (2)). For example, suppose 
that i has n neighbors with degree 1 (i.e. they only have i 
as neighbor); and the same thing for j, and that they are 
all matched pairwise — Figure 1 is an example of this with 
n = 3 if we suppose that no other neighbors are present 
in the KB. Suppose moreover that we use unit weights for 
Wik and Wji. Then the normalization is 7^ = 1/4 for each 
k £ Mi (as they have degree 1); and similarly for 7;. The 
contribution of the sum over the last two terms in (10) is 
thus n/2 (whereas in this case gij(y) = n/(n + 1) < 1). 

Neighbor weight Wik. We finally need to specify the 
weight Wik, which determines the strength of the contribu- 
tion of the neighbor k being correctly matched to the score 
of a suggested pair containing i. In our experiments, we 
consider both the constant weight Wn, = 1 and a weight Wik 
that varies inversely with the number of neighbors entity k 
has where the relationship is of the same type as the one 
with entity i. The motivation for the latter is explained in 
Appendix B. 

4. EXPERIMENTS 
4.1 Setup 

We made a prototype implementation of SiGMa in Python 9 
and compared its performance on benchmark datasets as 
well as on large-scale knowledge bases. All experiments were 
run on a cluster node Hexacore Intel Xeon E5650 2.66GHz 
with 46GB of RAM running Linux. Each knowledge base 
is represented as two text files containing a list of triples of 
relationships-facts and property-facts. The input to SiGMa 
is a pair of such KBs as well as a partial mapping between 
the relationships and properties of each KB which is used 
in the computation of the score in (5), and the definition of 
compatible-neighbors (4). The output of SiGMa is a list 
of matched pairs (ei,e2) with their score information and 
the iteration number at which they were added to the so- 
lution. We evaluate the final alignment (after reaching the 
stopping threshold) by comparing it to ground truth using 
the standard metrics of precision, recall and F-measure on 
the number of entities correctly matched. 10 The benchmark 
datasets are available together with corresponding ground 
truth data; for the large-scale knowledge bases, we built 
their ground truth using web url information as described 
in Section 4.2. 

9 The code and datasets will be made available at 
http:/ /mlg.eng. cam. ac.uk/slacoste/sigma. 
°Recall is defined in our setup as the number of correctly 
matched entities in KB\ divided by the number of entities 
with ground truth information in KB\. We note that recall 
is upper bounded by precision because our alignment is a 
1-1 function. 



We found reasonable values for the parameters of SiGMa 
by exploring its performance on the YAGO to IMDb pair 
(the methodology is described in Section 4.5), and then kept 
them fixed for all the other experimental comparisons (Sec- 
tion 4.3 and 4.4). This reflects the situation where one would 
like to apply SiGMa to a new dataset without ground truth 
or to minimize parameter adaptation. The standard pa- 
rameters that we used in these experiments are given in 
Appendix D. 

4.2 Datasets 

Our experiments were done both on several large-scale 
datasets and on some standard benchmark datasets from the 
ontology alignment evaluation initiative (OAEI) (Table 4). 
We describe these datasets below. 

Large-scale datasets. As mentioned throughout this pa- 
per so far, we used the dataset pair YAGO-IMDb as the main 
motivating example for developing and testing SiGMa. We 
also test SiGMa on the pair Freebase-IMDb, for which we 
could obtain a sizable ground truth. We describe here their 
construction. Both YAGO and Freebase are available as lists 
of triples from their respective websites. 11 IMDb, on the 
other hand, is given as a list of text files. 12 There are differ- 
ent files for different categories, e.g.: actors, producers, etc. 
We use these categories to construct a list of triples contain- 
ing facts about movies and people. Because SiGMa ignores 
relationships and properties that are not matched between 
the KBs, we could reduce the size of YAGO and Freebase by 
keeping only those facts which had a 1-1 mapping with IMDb 
as presented in Table 3, and the entities appearing in these 
facts. To facilitate the comparison of SiGMa with PARIS, 
the authors of PARIS kindly provided us their own version of 
IMDb that we will refer from now on as IMDb_PARIS — this 
version has actually a richer structure in terms of properties. 
We also kept in YAGO the relationships and properties which 
were aligned with those of IMDb.PARIS (Table 3). Table 4 
presents the number of unique entities and relationship-facts 
included in the relevant reduced datasets. We constructed 
the ground truth for YAGO-IMDb by scraping the relevant 
Wikipedia pages of entities to extract their link to the cor- 
responding IMDb page, which often appears in the 'external 
links' section. We then obtained the entity name by scrap- 
ing the corresponding IMDb page and matched it to our 
constructed database by using string matching (and some 
manual cleaning). We obtained 54K ground truth pairs this 
way. We used a similar process for Freebase-IMDb by access- 
ing the IMDb urls which were actually stored in the database. 
This yielded 293K pairs, probably one of the largest knowl- 
edge bases alignment ground truth sets to date. 

Benchmark datasets. We also tested SiGMa on three bench- 
mark dataset pairs provided by the ontology alignment eval- 
uation initiative (OAEI), which allowed us to compare the 
performance of SiGMa to some previously published meth- 
ods [18, 13]. From the OAEI 2009 edition, 13 we use the 
Rexa-DBLP instance matching benchmark from the domain 



1 YAG02 core was downloaded from: http://www.mpi- 
inf.mpg.dc/yago-naga/yago/downloads.html and Freebase from: 
http:/ /wiki. frcebase.com/wiki/Data_dumps. 
2 http: / /www.imdb.com/interfaces#plain 
3 http://oaei. ontologymatching.org/2009/instances/ 



YAGO 


IMDb_PARIS IMDb 


Freebase 


Relations 


actedln 


actedln actedln 


actedln 


directed 


directorOf directed 


directed 


produced 


producerOf produced 


produced 


created 


writerOf composed 




wasBornln 


bornln 




diedln 


deceasedln 




capitalOf 


locatedln 




Properties 


hasLabel 


hasLabel hasLabel 


hasLabel 


wasCreatedOnDate 


hasPro duct ion Year 


initialReleaseDate 


wasBornOnDate 


bornOn 




diedOnDate 


deceasedOn 




hasGivenName 


first Name 




hasFamilyName 


lastName 




hasGender 


gender 




hasHeight 


hasHeight 





Table 3: Manually aligned movie related relation- 
ships and properties in large-scale KBs. 



Dataset 


#facts 


^entities 


YAGO 


442K 


1.4M 


IMDb_PARIS 


20.9M 


4.8M 


IMDb 


9.3M 


3.1M 


Freebase 


1.5M 


17 IK 


DBLP 


2.5M 


1.6M 


Rexa 


12. 6K 


14. 7K 


personll 


500 


1000 


person 12 


500 


1000 


restaurantl 


113 


339 


restaurant2 


752 


2256 



Table 4: Datasets statistics 



of scientific publications. Rexa contains publications and 
authors as entities extracted from the search results of the 
Rexa search server. DBLP is a version of the DBLP dataset 
listing publications from the computer science domain. The 
pair has one matched relationship, author, as well several 
matched properties such as year, volume, journal name, 
pages, etc. Our goal was to align publications and authors. 
The other two datasets come from the Person-Restaurants 
(PR) task from the OAEI 2010 edition, 15 containing data 
about people and restaurants. In particular, there are per- 
sonll-personl2 pairs where the second entity is a copy of 
the first with one property field corrupted, and restaurantl- 
restaurants2 pairs coming from two different online databases 
that were manually aligned. All datasets were downloaded 
from the corresponding OAEI webpages, with dataset sizes 
given in Table 4. 

4.3 Exp. 1: Large-scale alignment 

In this experiment, we test the performance of SiGMa 
on the three pairs of large-scale KBs and compare it with 
PARIS [26], which is described in more details in the re- 
lated work Section 5. We also compare SiGMa and PARIS 
with the simple baseline of doing the unambiguous exact 
string matching step described in Section 3.2 which is used 
to obtain an initial match mo (called Exact-string). Table 5 
presents the results. Despite its simple greedy nature which 
never goes back to correct a mistake, SiGMa obtains an im- 
pressive F-measure above 90% for all datasets, significantly 
improving over the Exact-string baseline. We tried running 
PARIS [26] on a smaller subset of YAGO-IMDb, using the 

14 We note that the smaller eprints dataset also present in the 
benchmark was not suitable for 1-1 matchings as its ground 
truth had a large number of many-to-one matches, 
http:/ /oaei. ontologymatching.org/2010/im/index. html 



Dataset 


System 


Prec 


Rec 


F 


GT size 


# pred. 


Time 


Freebase-IMdb 


SiGMa 
Exact-String 


99 
99 


95 
70 


97 

82 


255k 


366k 
244k 


90 min 
1 min 


YAGO-IMDb 


SiGMa 
Exact-string 


98 
99 


93 
57 


95 
72 


54k 


188k 
162k 


50 min 
1 min 


YAGO-IMDb_PARIS 

(new ground truth) 


SiGMa 
PARIS 
Exact-string 


98 
97 
99 


96 
Mi 
56 


97 
97 
72 


57k 


237k 
702k 
202k 


70 min 
3100 min 
1 min 


YAGO-IMDb_PARIS 

(ground truth from [26]) 


SiGMa 
PARIS 
Exact-string 


98 
94 
99 


84 
90 
61 


91 
92 


Ilk 


237k 
702k 
202k 


70 min 
3100 min 
1 min 



Table 5: Exp. 1: Results (precision, recall, F- 
measure) on large-scale datasets for SiGMa in com- 
parison to a simple exact-matching phase on strings 
as well as PARIS [26]. The 'GT Size' column gives the 
number entities with ground truth information. Time is 
total running time, including loading the dataset (quoted 
from [26] for PARIS). 



code available from its author's website. It did not com- 
plete its first iteration after a week of computation and so we 
halted it (we did not have the SSD drive which seems crucial 
to reasonable running times). The results for PARIS in Ta- 
ble 5 are thus computed using the prediction files provided to 
us by its authors on the YAGO-IMDb_PARIS dataset. In or- 
der to better relate the YAGO-IMDb_PARIS results with the 
YAGO-IMDb ones, we also constructed a larger ground truth 
reference on YAGO-IMDb_PARIS by using the same process 
as described in Section 4.2. On both ground truth evalua- 
tions, SiGMa obtains a similar F-measure as PARIS, but in 
50x less time. On the other hand, we note that PARIS is 
solving the more general problem of instances and schema 
alignment, and was not provided any manual alignment be- 
tween relationships. The large difference of recall between 
PARIS and SiGMa on the ground truth from [26] can be ex- 
plained by the fact that more than a third of its entities 
had no neighbor; whereas the process used to construct the 
new larger ground truth included only entities participating 
in movie facts and thus having at least one neighbor. The 
recall of SiGMa actually increases for entities with increas- 
ing number of neighbors (going from 68% for entities in the 
ground truth from [26] with neighbor to 97% for entities 
with 5+ neighbors). 

About 2% of the predicted matched pairs from SiGMa on 
YAGO-IMDb have no word in common and thus zero string 
similarity - difficult pairs to match without any graph in- 
formation. Examples of these pairs came from spelling vari- 
ations of names, movie titles in different languages, foreign 
characters in names which are not handled uniformly or mul- 
tiple titles for movies (such as the 'Blood In, Blood Out' 
example of Figure 1). 

Error analysis. Examining the few errors made by SiGMa, 
we observed the following types of matching errors: 1) er- 
rors in the ground truth (either coming from the scraping 
scheme used; or from Wikipedia (YAGO) which had incor- 
rect information); 2) having multiple very similar entities 
(e.g. mistaking the 'making of of the movie vs. the movie 
itself); 3) pair of entities which shared exactly the same 
neighbors (e.g. two different movies with exactly the same 
actors) but without other discriminating information. Fi- 
nally, we note that going through the predictions of SiGMa 
that had a low property score revealed a significant num- 
ber of errors in the databases (e.g. wildly inconsistent birth 
dates for people), indicating that SiGMa could be used to 
highlight data inconsistencies between databases. 
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95 
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Exact-string 


100 


75 


86 




Rexa-DBLP 


SiGMa 


97 


90 


94 






SiGMa-linear 


96 


86 


91 


1464 




Exact-string 


98 


81 


89 




RiMOM 


80 


72 


76 





Table 6: Exp. 2: Results on the benchmark 
datasets for SiGMa, compared with PARIS [26] and 
RiMOM [18]. SiGMa-linear and Exact-string are also 
included on the interesting datasets as further com- 
parison points. 



4.4 Exp. 2: Benchmark comparisons 

In this experiment, we test the performance of SiGMa on 
the three benchmark datasets and compare them with the 
best published results so far that we are aware of: PARIS [26] 
for the Person-Restaurants datasets (which compared favor- 
ably over ObjectCoref [13]); and RiMoM [18] for Rexa-DBPL. 
Table 6 presents the results. We also include the results 
for Exact-string as a simple baseline as well as SiGMa-linear, 
which is the SiGMa algorithm without using the graph infor- 
mation at all 16 , to give an idea of how important the graph 
information is in these cases. 

Interestingly, SiGMa significantly improved the previous 
results without needing any parameter tweaking. The Person- 
Restaurants datasets did not have a rich relationship struc- 
ture to exploit: each entity (a person or a restaurant) was 
linked to exactly one another in a 1-1 bipartite fashion (their 
address). This is perhaps why SiGMa-linear is surprisingly 
able to perfectly match both the Person and Restaurants 
datasets. Analyzing the errors made by SiGMa, we noticed 
that they were due to a violation of the assumption that 
each entity is unique in each KB: the same address is repre- 
sented as different entities in Restaurant2, and SiGMa greed- 
ily matched the one which was not linked to another restau- 
rant in Restaurant2, thus reducing the graph score for the 
correct match. SiGMa-linear couldn't suffer from this prob- 
lem, and thus obtained a perfect matching. 

The Rexa-DBLP dataset has a more interesting relation- 
ship structure which is not just 1-1: papers have multiple 
authors and authors have written multiple papers, enabling 
the fire propagation algorithm to explore more possibilities. 
However, it appears that a purely string based algorithm 
can already do quite well on this dataset — Exact-string 
obtains a 89% F-measure, already significantly improving 
the previously best published results (RiMOM at 76% F- 
measure). SiGMa-linear improves this to 91%, and finally 
using the graph structure helps to improve this to 94%. This 
benchmark which has a medium size also highlights the nice 
scalability of SiGMa: despite using the interpreted language 
Python, our implementation runs in less than 10 minutes on 
this dataset, which can be compared to RiMOM taking 36 
hours on a 8-core server in 2009. 



6 SiGMa-linear is not using the graph score component (a is 
set to 0) and is only using the inverted index So to suggest 
candidates - not the neighbors in Afij. 
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Figure 2: Exp. 3: Precision/Recall curves for SiGMa 
on YAGO-IMDb with different scoring configurations. 

The filled circles indicate the maximum F-measure position 
on each curve, with the corresponding diamond giving the 
F-measure value at this recall point. 

4.5 Parameter experiments 

In this section, we explore the role of different configu- 
rations for SiGMa on the YAGO-IMDb pair, as well as de- 
termine which parameters to use for the other experiments. 
We recall that SiGMa with the final parameters (described in 
Appendix D) yields a 95% F-measure on this dataset (second 
section of Table 5). Experiments 5 and 6 which explore the 
optimal weighting schemes as well as the correct stopping 
threshold are described for completeness in Appendix E. 

4.5.1 Exp. 3: Score components 

In this experiment, we explore the importance of each 
part of the score function by running SiGMa with some 
parts turned off (which can be done by setting the a and 
j3 tradeoffs to or 1). The resulting precision / recall curves 
are plotted in Figure 2a. We can observe that turning off 
the static part of the score (string and property) has the 
biggest effect, decreasing the maximum F-measure from 95% 
to about 80% (to be contrasted with the 72% F-measure for 
Exact-string as shown in Table 5). By comparing SiGMa 
with SiGMa-linear, we see that including the graph infor- 
mation moves the F-measure from a bit below 85% to over 
95%, a significant gain, indicating that the graph structure 
is more important on this dataset than the OAEI benchmark 
datasets. 

4.5.2 Exp. 4: Matching seed 

In this experiment, we tested how important the size of 
the matching seed mo is for the performance of SiGMa. We 
report the following notable results. We ran SiGMa with no 
exact seed matching at all: we initialized it with a random 
exact match pair and let it explore the graph greedily (with 
the inverted index still making suggestions). This obtained 
an even better score than the standard setup: 99% of pre- 
cision, 94% recall and 96% F-measure, demonstrating that 
a good initial seed is actually not needed for this setup. If 
we do not use the inverted index but initialize SiGMa with 



the top 5% of the exact match sorted by their score in the 
context of the whole exact match, the performance drops a 
little, but SiGMa is still able to explore a large part of the 
graph: it obtains 99% / 87% / 92% of precision/recall/F- 
measure, illustrating the power of the graph information for 
this dataset. 

5. RELATED WORK 

We contrast here SiGMa with the work already mentioned 
in Section 2.2 and provide further links. In the ontology 
matching literature, the only approach which was applied 
to datasets of the size that we considered in this paper is 
the recently proposed PARIS [26], which solves the more 
general problem of matching instances, relationships and 
classes. The PARIS framework defines a normalized score 
between pairs of instances to match representing how likely 
they should be matched, 17 and which depends on the match- 
ing scores of their compatible neighbors. The final scores are 
obtained by first initializing (and fixing) the scores on pairs 
of literals, and then propagating the updates through the 
relationship graph using a fixed point iteration, yielding an 
analogous fire propagation of information as SiGMa, though 
it works with soft [0-l]-valued assignment whereas SiGMa 
works with hard {0,l}-valued ones. The authors handle the 
scalability issue of maintaining scores for all pairs by using 
a sparse representation with various pruning heuristics (in 
particular, keeping only the maximal assignment for each en- 
tity at each step, thus making the same 1-1 assumption that 
we did). An advantage of PARIS over SiGMa is that it is able 
to include property values in its neighborhood graph (it uses 
soft-assignments between them) whereas SiGMa only uses re- 
lationships given that a 1-1 matching of property values is 
not appropriate. We conjecture that this could explain the 
higher recall that PARIS obtained on entities which had no 
relationship neighbors on the YAGO-PARISJMDB dataset. 
On the other hand, PARIS was limited to use a 0-1 simi- 
larity measure between property values for the large-scale 
experiments in [26], as it is unclear how one could apply 
the same sparsity optimization in a scalable fashion with 
more involved similarity measures (such as the IDF one that 
SiGMa is using). The use of a 0-1 similarity measure on 
strings could explain the lower performance of PARIS on the 
Restaurants dataset in comparison to SiGMa. We stress that 
SiGMa is able in contrast to use sophisticated similarity mea- 
sures in a scalable fashion, and had a 50x speed improvement 
over PARIS on the large-scale datasets. 

The SiGMa algorithm is related to the collective entity res- 
olution approach of Bhattacharya and Getoor [3] , which pro- 
posed a greedy agglomerative clustering algorithm to cluster 
entities based on previous decisions. Their approach could 
handle constraints on the clustering, including a 1 — 1 match- 
ing constraint in theory, though it was not implemented. A 
scalable solution for collective entity resolution was proposed 
recently in [23], by treating the sophisticated machine learn- 
ing approaches to entity resolution as black boxes (see ref- 
erences therein), but running them on small neighborhoods 
and combining their output using a message-passing scheme. 
They do not consider exploiting a 1 — 1 matching constraint 
though, as most entity resolution or record linkage work. 



17 The authors call these 'marginal probabilities' as they were 
motivated from probabilistic arguments, but these do not 
sum to one. 



The idea to propagate information on a relationship graph 
has been used in several other approaches for ontology match- 
ing [14, 19], though none were scalable for the size of knowl- 
edge bases that we considered. An analogous 'fire prop- 
agation' algorithm has been used to align social network 
graphs in [20], though with a very different objective func- 
tion (they define weights in each graphs and want to align 
edges which has similar weights). The heuristic of prop- 
agating information on a relationship graph is related to 
a well-known heuristic for solving Constraint Satisfactions 
Problems known as constraint propagation [2]. Ehrig and 
Staab [7] mentioned several heuristics to reduce the number 
of candidates to consider in ontology alignment, including a 
similar one to compatible-neighbors, though they tested 
their approach only on a few hundred instances. Finally, we 
mention that Peralta [22] aligned the movie database Movie- 
Lens to IMDb through a combination of steps of manual 
cleaning with some automation. SiGMa could be considered 
as an alternative which does not require manual intervention 
apart specifying the score function to use. 

6. CONCLUSION 

We have presented SiGMa, a simple and scalable algo- 
rithm for the alignment of large-scale knowledge bases. De- 
spite making greedy decisions and never backtracking to cor- 
rect decisions, SiGMa obtained a higher F-measure than the 
previously best published results on the OAEI benchmark 
datasets, and matched the performance of the more involved 
algorithm PARIS while being 50x faster on large-scale knowl- 
edge bases of millions of entities. Our experiments indicate 
that SiGMa can obtain good performance over a range of 
datasets with the same parameter setting. On the other 
hand, SiGMa is easily extensible to more powerful scoring 
functions between entities, as long as they can be efficiently 
computed. 

Some apparent limitations of SiGMa are a) that it can- 
not correct previous mistakes and b) cannot handle align- 
ments other than 1-1. Addressing these in a scalable fash- 
ion which preserves high accuracy are open questions for 
future work. We note though that the non-corrective nature 
of the algorithm didn't seem to be an issue in our experi- 
ments. Moreover, pre-processing each knowledge base with 
a de-duplication method can help make the 1-1 assumption 
more reasonable, which is a powerful feature to exploit in 
an alignment algorithm. Another interesting direction for 
future work would be to use machine learning methods to 
learn the parameters of more powerful scoring function. In 
particular, the 'learning to rank' model seems suitable to 
learn a score function which would rank the correctly la- 
beled matched pairs above the other ones. The current level 
of performance of SiGMa already makes it suitable though 
as a powerful generic alignment tool for knowledge bases and 
hence takes us closer to the vision of Linked Open Data and 
the Semantic Web. 
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APPENDIX 

A. PROPERTY SIMILARITY MEASURE 

We describe here the property similarity measure used in our 
implementation. We use a smoothed weighted Jaccard similarity 
measure between the sets of properties defined as follows. Sup- 
pose that e\ has properties pi,P2, • ■ ■ , Pm with respective literal 
values vi,V2, ■ • ■ , v ni , and that e2 has properties qi, 92, • • ■ , <7n 2 
with respective literal values (1, 12, ■ ■ ■ , ln 2 ■ 111 analogy to the 
string similarity measure, we will also associate IDF weights to 

JV° 

the possible property values w° v = log 10 go p | where E° v = 

{e S £ '■ e has literal v for property p} and JV° is the total num- 
ber of entities in knowledge base o which have a value for property 
p. We then define the following property similarity measure: 



a=l 6=1 

where M12 represents the property alignment: M12 = {(1,6) : 
p a is matched to q }. Sim Pay q b (v a , if,) is a [0, l]-valued similarity 
measure between literals; it could be a normalized distance on 
numbers (for dates, years, etc.), a string-edit distance on strings, 
etc. 




Figure 3: Graph weight illustration. The contribution 
of the movie match yui = 1 should be weighted more for 
the candidate match pairing the only director i of k with a 
director of movie I (wik = 1) as compared to the candidate 
match pairing one of the many actors i' of k with an actor 
of the movie I (iiVfc = 1/2 for two actors in movie A:). This 
weighting scheme can also be thought of ensuring that the 
contribution of the match (k, I) spreads uniformly amongst 
all its neighbors with one unit of influence per relationship 
type in each KB separately. 



B. GRAPH NEIGHBOR WEIGHT 

We recall that the the graph weight determines the strength 
of the contribution of the neighbor k being correctly matched to 
the score of a suggested pair containing i. In our experiments, we 
consider both the constant weight w^. = 1 and a weight 10,4, that 
varies inversely with the number of neighbors entity k has where 
the relationship is of the same type as the one with entity i. To 
motivate the latter, we go back again to our running example of 
Figure 1 , but switching the role of i and k as we need to look at 
the neighbors of k — this is illustrated in Figure 3 and explained in 
its caption. In case there are multiple different relationships link- 
ing the same pair i to k, we take the maximum of the weights over 
these (i.e. we pick the most informative information to weight it). 
So formally, we have: 

w ik = max :{i',r,k)eT R \- 1 . (12) 

r s.t. (i,r,k)£J-R 

We also point out that the normalization of gij(y) in (8) is 
made over each KB independently, in contrast with the string 
and prop similarity measures (7) and (11) which are normalized 
in both KB jointly. The motivation for this is that the neigh- 
borhood size in YAGO and IMDb are overly asymmetric (there is 
much more information about each movie in IMDb). The separate 
normalization means that as long as most of a neighborhood in 
one KB is correctly aligned, the graph score will be high. The 
information about strings and properties is more symmetric in 
the KB pairs that we consider, so a joint normalization seems 
reasonable in this case. 



C. QUADRATIC ASSIGNMENT PROBLEM 

The quadratic assignment problem is traditionally defined as 
finding a bijection between R facilities and R locations which min- 
imizes the expected cost of transport between the facilities. Given 
that facilities i and fc are assigned to locations j and I respectively, 
the cost of transport between facility i and k is Wj,- fc; = n^c,;, 
where is the expected number of units to ship between facil- 
ities i and k, and Cji is the expected cost of shipment between 
locations j and I (depending on their distance). In its more gen- 
eral form [17], the coefficients can be negative, and so there is no 
major difference between minimizing and maximizing, and we see 
that our optimization problem (3) is a special case of this. 



D. PARAMETERS USED FOR SIGMA 

We use a = 1/3 as the graph score tradeoff 18 in (5) and 
P = 0.25 as the property score tradeoff in (6). We set the string 
score smoothing term in (7) as the sum of the maximum possi- 
ble word weights in each KB (\og\£ \). We use 0.25 as the score 
threshold for the stopping criterion (step 6 in the algorithm), and 
stop considering suggestions from the inverted index on strings 
when their score is below 0.75. We use as initial matching the 
unambiguous exact string comparison test as described in Sec- 
tion 3. We use uniform weights tOifc = 1 for the matched neigh- 
bors contribution in the graph score (10). We use a Sim measure 
on property values as used in (11) which depends on the type 
of property literals: for dates and numbers, we simply use 0-1 
similarity (1 when they are equal) with some processing — e.g. 
for dates, we only consider the year; for secondary strings (i.e. 
strings for other properties than the main string representation 
of an entity), we use a weighted Jaccard measure on words as 
defined in (7) but with the IDF weights derived from the strings 
appearing in this property only. 

E. ADDITIONAL PARAMETER 
EXPERIMENTS 

We provide here the additional parameter experiments which 
were skipped from the main text for brevity. 

E.l Exp. 5: Weighting schemes, smoothing 
and tradeoffs 

In this experiment, we explored the effect of the weighting 
scheme for the three different score components (string, property 
and graph) by trying two options per component, with precision 
/ recall curves given in Figure 4. For string and property com- 
ponents, we compared uniform weights vs. IDF weights. For the 
graph component, we compare uniform weights (which surpris- 
ingly got the best result) with the inverse number of neighbors 
weight proposed in (12). Overall, the effect for these variations 



This value of a has the nice theoretical justification that 
it gives twice much more weight to the linear term than the 
quadratic term, a standard weighting scheme given that the 
derivative of the quadratic yields the extra factor of two to 
compensate. 
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Figure 4: Exp. 5: Precision/Recall curves for SiGMa 
on YAGO-IMDb with different weighting configura- 
tions. The filled circles indicate the maximum F-measure 
position on each curve, with the corresponding diamond giv- 
ing the F-measure value at this recall point. Each curve is 
one of the 8 possibilities of having the weight 'off' (set to 
unity) or 'on', for the graph / property / string part of the 
score function. The legend indicates the difference between 
the reference setup (graph off / property on / string on) and 
the given curve. 

was much smaller than the one for the score component exper- 
iment, with the biggest decrease of less than 1% F-measure ob- 
tained by using uniform string weights instead of the IDF-scores. 
We also varied the 3 smoothing parameters (one for each score 
component) as well as the 2 tradeoff parameters linearly around 
their chosen values: the performance does not change much for 
changes of the order of 0.1-0.2 for the tradeoff, and 1.5 for the 
smoothing parameters (stay with 1% range of F-measure). 

E.2 Exp. 6: Stopping threshold choice 

In this experiment, we studied whether the score information 
correlated with changes in the precision / recall information, in 
order to determine a possible stopping threshold. We overlay in 
Figure 5 the precision / recall at each iteration of the algorithm 
(blue / red) with the score (in green) of the matched pair chosen 
at this iteration (as given by (5)). The vertical black dashed 
lines correspond to the iteration at which the score threshold of 
0.35 and 0.25 are reached, respectively, which correlated with 
a drop of precision for the current predictions (black line with 
diamonds) and a leveling of the F-measure (curved dashed black 
line), respectively. We note that this correlation was also observed 
on all the other datasets, indicating that this threshold is robust 
to datasct variations. 



Figure 5: Exp. 6: Precision/recall and score evolu- 
tion for SiGMa on the YAGO-IMDb dataset as a func- 
tion of iterations (predictions). The magenta line indi- 
cates the proportion out of the last lk predictions for which 
we had ground truth information; the black line with dia- 
monds indicate the precision for these lk predictions. The 
score of the matching pair chosen at each iteration is shown 
in green; notice how the precision starts to drop when the 
score goes below 0.35 (first vertical black dashed line) and 
the F-measure starts to level when the score goes below 0.25 
(second vertical dashed line). We note that the periodic in- 
crease of the score is explained by the fact that if compatible 
neighbors are matched, the graph score part (10) of their 
neighbors can increase sufficiently to exceed the previous 
maximum score in the priority queue. 



